In this project I am using the No-Show Appointment Dataset to analyse and answer some question using this dataset. These questions are:
1) Does age (younger or older) have any kind of impact or relationship in terms of people showing up for appointments or not?
2) Does the combination of age and scholarship have some kind of impact or relationship in terms of people showing up for appointments or not?
'''Importing all the required packages'''
import pandas as pd # For working with dataset
import numpy as np # For working with numerical data
import matplotlib.pyplot as plt
import seaborn as sns #For better visualisation
%matplotlib inline
"""Defining function for the program"""
def show(dataframe, type):
if type == "data":
to_show=dataframe.head()
if type == 'info':
to_show=dataframe.info()
if type == 'desc':
to_show=dataframe.describe()
if type == 'size':
to_show=dataframe.shape
return to_show
When I tried to read this csv file I found an error : “UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 3: invalid continuation byte”. To solve this issue I used
encoding='latin-1'in our code while reading this csv file.
Loading this data set as df as it is short for use and also relates with term dataframe. The df contains all the dataset as availabe in the raw data set.
DATA_URL='F:\\Github_repo\\Data-Analyst-Investigate-Dataset\\Data_set.csv'
encoding_type='latin-1'
df = pd.read_csv(DATA_URL, encoding=encoding_type)
df.head()
show(df,"desc")
show(df,"info")
Taking a closer look at some aspects of the data from the above dataframes. The following few cells of code in this section are looking into the maximum listed age of 115. First create a sub-group of 'appt' dataframe where 'Age' column values are equal to '115'.
age_max= df.query("Age == 115")
age_max
age_max['PatientId'].nunique() # Here I'm using the nunique() function of pandas to know the no. of unique values in the desired dataframe.
age_min= df.query("Age == -1")
df.Age.replace([-1],[0], inplace = True)
show(df,"desc")
show(df,"info")
Does age (younger or older) have any kind of impact/ relationship in terms of people showing up for appointments or not?
In order to explore the above question I'm creating a copy of the 'df' dataframe to a new dataframe named as 'df_new' so that I have a backup of the original dataframe.
Now, using the 'df_new' dataframe, I'm creating a boxplot and using column "Age" as the x-axis and "No-show" for the y-axis.
'''Here I am plotting a boxplot to analyse the relationship between "Age" and "Peoples present for appointments" '''
df_new= df
sns.boxplot(x="Age", y="No-show", palette=["g", "r"], data=df_new).set_title('Age Distribution Split by No-Show Category');
Does the combination of age and scholarship have any impact/ relationship with people showing up for appointments or not?
To Answer this question here I'm using
sns.catplot()function of seaborn library to plot a bar grap between Scholarship-Age and peoples present for their appointments
sns.catplot(x="Scholarship", y="Age", col="No-show", data=df_new, height=6, kind="bar", palette="muted");
Is there any relationship between the Diabetes disease and the peoples present for appointments?
This question help us to know that patients suffering from Diabetes disease are concious about there appointments or not? To Answer this question here I'm using
sns.catplot()function of seaborn library to plot a bar grap between Diabetes-Age and peoples present for their appointments
sns.set(style="ticks")
sns.catplot(x="Diabetes", y="Age" , col="No-show", data=df_new);
Result
1) Overall, there wasn't a huge difference in age for those who did or didn't show up to appointments. But I believe that the difference would have been higher for the group who did present for appointments rather that nearly 4 times larger than the group of people who aren't present for their appointments.
2) While the age differences aren't very wide, the people who didn't show up to appointments tended to be younger and that is also the same for whether or not these 2 groups of people had healthcare scholarships. Also there is a fact that the No-show=Yes group is nearly 4 times smaller than the other group.
3) Overall, after performing all the task I can conclude that there is not perfect evidence that either Age or Scholarship status have an strong relationship with people who were present for their appointment or not. Smaller number of people who were present for their appointments than who weren't might also can be a reason for this result.
4) Using the current dataset, I tried to make a relationship between age and No-show status of the patients. But It seems that there isn't any relationship between the Age and Disease and alsi it has no impact on appointments.
Limitations:
1) Given that Scholarship only has 0 or 1 for possible answers - it was tough to find good visuals that would also be able to work with Scholarship and still provide some insight and be easy to understand.
2) Lots of the columns used categorical data which makes it more difficult to analyze and visualize. This in turn somewhat hinders the ability to find any strong correlations between columns.
3) Again, the unbalance split between the No-show Yes and No-show No groups did't allow for a truly balanced or equal analysis to be done but at the same time this uneven split showed some potentially interesting areas that could be further explored.
4) The Data given in this dataset is completely scattered and it looks like there isn't any relation between data with one another.
from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])